CELTYC 0.1.0
Researchers have been proposing cancer taxonomies using omic profiles generated in bulk-tissue, and thus, whereas classifications based on somatic mutations and copy-number alterations reflect the underlying patterns of genomic alterations in the tumor cell-of-origin, classifications derived from bulk transcriptomic and/or epigenetic data are inevitably confounded by cell-type heterogeneity (CTH). Moreover, CTH means that it is much harder to pinpoint whether specific cancer-associated transcriptomic or epigenetic changes are happening in the tumor cell-of-origin and not in tumor stroma. Therefore, it is critical to explore novel classifications of tumor-types in terms of the cell-type specific transcriptomic and epigenetic changes for improved understanding of how distinct cancer subtypes emerge in relation to the functional changes that happen in the tumor cell of origin and in the various types of tumor-associated stromal cells.
Here we propose a novel strategy, based on the concept of “cell-type specific clustering” (CELTYC), to refine the molecular classification of cancer-types. The key innovative idea behind this proposal is that we can identify the features (CpGs/genes) driving disease-relevant cell-type specific variation, so that clustering over these features results in disease subtypes that are not confounded by variations in cell-type composition. We evaluate the above CELTYC strategy in the context of DNA methylation (DNAm) data. By applying our novel strategy to liver hepatocellular carcinoma (LIHC) DNAm data, we reveal novel biologically and clinically relevant tumor classifications.
Basically, CELTYC workflow includes:
In order to run the tutorial we must first load the necessary package and data. We have saved all the necessary data objects in a list named LIHC_data.
knitr::opts_chunk$set(crop = NULL)
library(CELTYC)
data("LIHC_data") # load the data used in the following analysis
The first step of CELTYC is to estimate cell-type fractions from
bulk DNA methylation (DNAm) data. Here we use the normalized TCGA LIHC 450k DNAm data bmiq.m, which is a matrix with rows labeling CpGs and columns labeling samples (including both cancer and normal samples) for demonstration. To estimate cell-type fractions (CTFs), we use an R package EpiSCORE which provides a DNAm reference matrix for the liver tissue. Because the full LIHC data bmiq.m is too large for inclusion here, we just display the syntax:
library(EpiSCORE)
avDNAm.m <- constAvBetaTSS(bmiq.m,type="450k") #computing the average DNAm over a window 200bp upstream of TSS, or if not available over 1st exon probes.
estF.o <- wRPC(avDNAm.m,ref=mrefLiver.m,useW=TRUE,wth=0.4,maxit=500) # Estimating cell-type fraction with a DNAm reference matrix for liver via weighted robust linear regression
estF.m <- estF.o$estF
estF.m is the estimated cell-type fraction matrix for all samples in bmiq.m. It contains fractions of 5 cell types in the liver tissue: cholangiocytes (Chol), endothelial cells (EC), hepatocytes (Hep), Kupffer cells (Kup), lymphocytes (Lym).
Next, we aim to identify cancer-associated DMCTs employing the previously published CellDMC algorithm. We can apply the function EpiDISH::CellDMC to identify DMCTs, which requires the input of the abovementioned matrix bmiq.m, and the corresponding phenotype information (pheno.df, a data.frame with rows that match the columns in bmiq.m). The output of EpiDISH::CellDMC can then be input into the function GetDMCT, which further extracts DMCT sets altered in each of the different cell types. By setting the parameter “nDMCT” to be 100, we can retrieve only DMCT sets containing at least 100 DMCTs for a cell type. This criterion helps prevent the selection of a very limited number of DMCTs, which would be inadequate for performing cell-type specific clustering. Additionally, we can retrieve DMCTs exclusively altered in each cell type by setting the “specific” parameter to TRUE. For retrieving DMCTs common to multiple cell types, we set the “common” parameter to TRUE and specify a value for parameter “nCT”, representing the number of cell types to consider for common DMCTs. Again here we only display the syntax:
library(EpiDISH)
phe.v <- pheno.df$cancer # a vector indicating whether a sample is "Cancer" or "Normal"
phe.v[which(phe.v=="Cancer")] <- 1
phe.v[which(phe.v=="Normal")] <- 0
phe.v <- as.numeric(phe.v)
sex.v <- pheno.df$gender
sex.v[which(sex.v=="MALE")] <- 1
sex.v[which(sex.v=="FEMALE")] <- 2
sex.v <- as.numeric(sex.v)
age.v <- pheno.df$age_at_initial_pathologic_diagnosis
na.idx <- which(is.na(age.v)|is.na(sex.v)) # the running of CellDMC does not allow NA in input
cov.mod.use <- model.matrix(~sex.v[-na.idx]+age.v[-na.idx]) # exclude the confounding effect of age and sex
celldmc.o <- CellDMC(beta.m = bmiq.m[,-na.idx], pheno.v=phe.v[-na.idx], frac.m = estF.m[-na.idx,], mc.cores = 40,cov.mod = cov.mod.use)
dmct.lv <- GetDMCT(celldmc.o,nDMCT = 100,specific = T,nSpDMCT = 100,common = T,nCT = 3,nCmDMCT = 100) # nSpDMCT is the threshold to save DMCTs specific to a particular cell type, and nCmDMCT is the threshold to save DMCTs common to nCT cell types
We have saved the returned result of GetDMCT (i.e. the abovementioned dmct.lv) as LIHC_data$allDMCT. LIHC_data$allDMCT is a list object containing three sublists. The first sublist named “dmct” contains the DMCT sets altered in each cell type. The second sublist named “spec” contains the DMCT sets exclusively altered in each cell type (i.e. specific to a cell type). The third sublist named “comm” contains the DMCT sets common to various combinations of three cell types. To visualize the number of DMCTs for each cell type and their overlaps, we can run:
library(SuperExactTest)
supertest.o <- supertest(LIHC_data$allDMCT$dmct,n=403197)
plot(supertest.o,"landscape",sort.by = "size",color.scale.cex = 0.8,overlap.size.cex = 0.7,cex=0.8)
As we can see from the figure above, most DMCTs occur in lymphocytes (Lym), hepatocytes (Hep) and endothelial cells (EC). Therefore, we focus on DMCTs altered in these 3 cell types in the subsequent analyses.
Now that we have identified DMCTs, we can perform cell-type specific clustering (CELTYC) on DNAm data over different sets of DMCTs for cancer samples. To accomplish this, we will apply the GenResidualMat function to regress out the estimated cell-type fractions (CTFs) from the input DNAm matrix and then z-score standardize the resulting residual matrix. Afterward, we will use the DoCELTYC function to conduct consensus clustering directly on the standardized residual matrix over the specified DMCT sets, setting the “method” parameter to “consensus”.
Here we can use the prepared DNAm data matrix LIHC_data$DNAm for LIHC cancer samples and the corresponding cell-type fraction matrix LIHC_data$estF. We can then do clustering restricting to 3 sets of DMCTs (stored in LIHC_data$allDMCT$spec): DMCTs exclusively altered in lymphocytes, hepatocytes and endothelial cells respectively.
res.m <- GenResidualMat(LIHC_data$DNAm,estCTF.m = LIHC_data$estF,standardize = T,ncores = 40) # regress out CTFs from DNAm matrix and get standardized residual matrix
CELTYC.results.l <- DoCELTYC(res.m,method = "consensus",maxK = 3,dmct.lv = LIHC_data$allDMCT$spec[c("Lym","Hep","EC")],title = "cluster-test") # do consensus clustering on standardized residual matrix over DMCTs specific to lymphocytes, hepatocytes and endothelial cells
## Start doing consensus clustering on standardized residual matrix over input DMC list.
## end fraction
## clustered
## clustered
## end fraction
## clustered
## clustered
## end fraction
## clustered
## clustered
We can check whether the clusters obtained by using DMCTs specific to different cell types are different. We can look at the pre-generated heatmap LIHC_data$heatmap displaying the standardized residual matrix annotated by the clustering assignments obtained by using DMCTs specific to different cell types:
library(ComplexHeatmap)
lym.clust.v <- CELTYC.results.l$Lym[[3]]$consensusClass
hep.clust.v <- CELTYC.results.l$Hep[[3]]$consensusClass
ec.clust.v <- CELTYC.results.l$EC[[2]]$consensusClass # if the chosen cluster number is 3, the 3rd cluster contains few samples
print("Compare the lym-clusters and hep-clusters:")
## [1] "Compare the lym-clusters and hep-clusters:"
table(lym.clust.v,hep.clust.v)
## hep.clust.v
## lym.clust.v 1 2 3
## 1 60 54 39
## 2 33 45 12
## 3 58 38 40
print("Compare the lym-clusters and EC-clusters:")
## [1] "Compare the lym-clusters and EC-clusters:"
table(lym.clust.v,ec.clust.v)
## ec.clust.v
## lym.clust.v 1 2
## 1 143 10
## 2 88 2
## 3 21 115
print("Compare the hep-clusters and EC-clusters:")
## [1] "Compare the hep-clusters and EC-clusters:"
table(hep.clust.v,ec.clust.v)
## ec.clust.v
## hep.clust.v 1 2
## 1 98 53
## 2 104 33
## 3 50 41
ComplexHeatmap::draw(LIHC_data$heatmap)
From the above results, we can see that the sample compositions differ between hepatocyte and lymphocyte clusters, as well as between hepatocyte and endothelial clusters. One of the endothelial clusters is dominantly enriched in one of the lymphocyte clusters. We then explore whether clusters obtained with DMCTs specific to different cell types are associated with clinical outcome.
In particular, we want to see whether the newly obtained clusters using different sets of DMCTs are significantly associated with overall survival (OS). We use the prepared data.frame object LIHC_data$pheno storing the phenotype information including OS time and death event:
library(survival)
# for clustering results using DMCTs for different cell types:
clust.l <- list(Lym=lym.clust.v,Hep=hep.clust.v,EC=ec.clust.v)
surv.res.l <- list()
for(m in 1:3){
clust.v <- clust.l[[m]]
chi.surv.p.v <- vector()
tmp.v <- vector()
count <- length(unique(na.omit(clust.v)))
cl <- sort(unique(clust.v))
# generate pairwise P values comparing survival probability between different clusters
for(i in 1:(count-1)){
for(j in (i+1):count){
tmp.time <- LIHC_data$pheno$OS.time[clust.v %in% c(cl[i],cl[j])]
tmp.event <- LIHC_data$pheno$event[clust.v %in% c(cl[i],cl[j])]
tmp.clust.v <- clust.v[clust.v %in% c(cl[i],cl[j])]
surv.o <- Surv(time = tmp.time,event = tmp.event)
cox.o <- coxph(surv.o~tmp.clust.v)
chisq.pval <- pchisq(cox.o$score,1,lower.tail = F)
chi.surv.p.v <- c(chi.surv.p.v,chisq.pval)
tmp.v <- c(tmp.v,paste0("cl ",cl[i],",",cl[j]))
}
}
names(chi.surv.p.v) <- tmp.v
# plot Kaplan Meier curves
surv.o <- Surv(time = LIHC_data$pheno$OS.time,event = LIHC_data$pheno$event)
survfit.o <- survfit(surv.o ~ clust.v)
surv.res.l[[m]] <- list()
surv.res.l[[m]][["pair-Pval"]] <- chi.surv.p.v
surv.res.l[[m]][["survfit"]] <- survfit.o
}
par(mar=c(2,2,2,2))
# for clustering results using lym DMCTs:
plot(surv.res.l[[1]]$survfit, col = c("firebrick","dodgerblue","orange"),lwd=2, mark.time=TRUE, xlab="Years", ylab="OS",xscale = 365.25,cex.lab=0.8,cex.axis=0.8,mgp = c(1, 0.5, 0),tck = -0.03,main="Lym clusters",cex.main=1)
text(x = 1400,y=0.1,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[1]," Chisq P=",signif(surv.res.l[[1]]$`pair-Pval`[1],2)),cex = 0.9)
text(x = 1400,y=0.15,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[2]," Chisq P=",signif(surv.res.l[[1]]$`pair-Pval`[2],2)),cex = 0.9)
text(x = 1400,y=0.2,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[3]," Chisq P=",signif(surv.res.l[[1]]$`pair-Pval`[3],2)),cex = 0.9)
legend("topright",legend=paste0("cl",1:3),col=c("firebrick","dodgerblue","orange"),
lty = 1,inset = 0.05,cex = 0.9,bty="n")
# for clustering results using Hep DMCTs:
plot(surv.res.l[[2]]$survfit, col = c("#FF8080","#80FF80","#8080FF"),lwd=2, mark.time=TRUE, xlab="Years", ylab="OS",xscale = 365.25,cex.lab=0.8,cex.axis=0.8,mgp = c(1, 0.5, 0),tck = -0.03,main="Hep clusters",cex.main=1)
text(x = 1400,y=0.1,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[1]," Chisq P=",signif(surv.res.l[[2]]$`pair-Pval`[1],2)),cex = 0.9)
text(x = 1400,y=0.15,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[2]," Chisq P=",signif(surv.res.l[[2]]$`pair-Pval`[2],2)),cex = 0.9)
text(x = 1400,y=0.2,label=paste0(names(surv.res.l[[1]]$`pair-Pval`)[3]," Chisq P=",signif(surv.res.l[[2]]$`pair-Pval`[3],2)),cex = 0.9)
legend("topright",legend=paste0("cl",1:3),col=c("#FF8080","#80FF80","#8080FF"),
lty = 1,inset = 0.05,cex = 0.9,bty="n")
# for clustering results using EC DMCTs:
plot(surv.res.l[[3]]$survfit, col = c("#F06000","#408030"),lwd=2, mark.time=TRUE, xlab="Years", ylab="OS",xscale = 365.25,cex.lab=0.8,cex.axis=0.8,mgp = c(1, 0.5, 0),tck = -0.03,main="EC clusters",cex.main=1)
text(x = 1400,y=0.1,label=paste0(names(surv.res.l[[3]]$`pair-Pval`)[1]," Chisq P=",signif(surv.res.l[[2]]$`pair-Pval`[1],2)),cex = 0.9)
legend("topright",legend=paste0("cl",1:2),col=c("#F06000","#408030"),
lty = 1,inset = 0.05,cex = 0.9,bty="n")
The results we obtain show that CELTYC enables us to divide LIHC samples into different clusters with clear segregation of overall survival when we use DMCTs specific to lymphocytes. However, no distinguishable segregation was observed when employing DMCTs specific to the other two cell types.
In the example above we introduce doing CELTYC by setting the parameter “method” of DoCELTYC function to be “consensus”. However, we can also implement a statistical procedure called JIVE (Joint and Individual Variation Explained) before doing clustering. By setting the “method” in DoCELTYC function to be “jive”, we can do jive analysis on the input data matrix (“data.m” in DoCELTYC) first. To conduct jive analysis in DoCELTYC, one needs to input a DMCT list for parameter “dmct.lv”, where there are no overlapped features between every two entries in “dmct.lv”. If we input 4 DMCT sets here: 3 DMCT sets unique to one cell type, and one DMCT set shared by all three cell types, “jive” method will help extract out a joint variation (JV) matrix representing variation common to all 4 DMCT sets, and 4 individual variation (IV) matrices representing variation that are unique to each of the input DMCT sets, utilizing an R-package r.jive. Then clustering will be performed on the obtained JV and IV matrices respectively. The jive-version DoCELTYC will return a list saving the jive analysis results as well as the clustering results.
In this case we can use the prepared list LIHC_data$allDMCT which stores the DMCTs specific to lymphocytes, hepatocytes and endothelials, as well as DMCTs shared by all three cell types. To save the running time, we only display the syntax for performing CELTYC with “jive” method on the standardized residual matrix generated in the previous section for LIHC cancer samples:
selDMCT.lv <- c(LIHC_data$allDMCT$spec[c("Lym","Hep","EC")],LIHC_data$allDMCT$comm["EC_Hep_Lym"]) # extract the cell-type specific DMCTs and the common DMCTs
jive.results.l <- DoCELTYC(data.m = res.m,method = "jive",maxK = 3,dmct.lv = selDMCT.lv,title = "cluster-test")
jive.IV.lym.clust.l <- jive.results.l$Lym # extract the clustering results on JIVE-derived individual variation matrix for lymphocyte-specific DMCTs
We have saved the clustering results obtained from the JIVE-derived individual variation matrix for lymphocyte-specific DMCTs (i.e. the abovementioned jive.IV.lym.clust.l) as LIHC_data$jive_IV_lym, and now we can check the correlation between the clusters obtained with “jive” method and overall survival.
jive.clust.v <- LIHC_data$jive_IV_lym[[2]]$consensusClass # extract the cluster assignments when cluster number is 2
surv.o <- Surv(time = LIHC_data$pheno$OS.time,event = LIHC_data$pheno$event)
cox.o <- coxph(surv.o~jive.clust.v)
chisq.pval <- pchisq(cox.o$score,1,lower.tail = F)
survfit.o <- survfit(surv.o ~ jive.clust.v)
par(mar=c(2,2,2,2))
plot(survfit.o, col = c("coral","#CF30CF"),lwd=2, mark.time=TRUE, xlab="Years", ylab="OS",xscale = 365.25,cex.lab=0.8,cex.axis=0.8,mgp = c(1, 0.5, 0),tck = -0.03,main="JIVE IV Lym clusters",cex.main=1)
text(x = 1400,y=0.1,label=paste0("cl 1,2 Chisq P=",signif(chisq.pval,2)),cex = 0.9)
legend("topright",legend=paste0("cl",1:2),col=c("coral","#CF30CF"),
lty = 1,inset = 0.05,cex = 0.9,bty="n")
From this Kaplan Meier curve plot, we can see that by doing clustering on individual variation matrix for lymphocyte specific DMCTs, we can also obtain clusters distinctly segregated in terms of clinical outcome.
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS: /usr/local/lib64/R/4.4.4/lib64/R/lib/libRblas.so
## LAPACK: /usr/local/lib64/R/4.4.4/lib64/R/lib/libRlapack.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Asia/Shanghai
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] survival_3.6-4 ComplexHeatmap_2.19.0 SuperExactTest_1.1.0
## [4] CELTYC_0.1.0 BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 gplots_3.1.3.1
## [3] ConsensusClusterPlus_1.58.0 bitops_1.0-7
## [5] KernSmooth_2.23-22 shape_1.4.6.1
## [7] gtools_3.9.5 lattice_0.22-6
## [9] magrittr_2.0.3 digest_0.6.35
## [11] caTools_1.18.2 RColorBrewer_1.1-3
## [13] evaluate_0.23 bookdown_0.39
## [15] iterators_1.0.14 circlize_0.4.16
## [17] fastmap_1.1.1 Matrix_1.7-0
## [19] foreach_1.5.2 doParallel_1.0.17
## [21] jsonlite_1.8.8 GlobalOptions_0.1.2
## [23] BiocManager_1.30.22 codetools_0.2-20
## [25] jquerylib_0.1.4 abind_1.4-5
## [27] cli_3.6.2 rlang_1.1.3
## [29] crayon_1.5.2 Biobase_2.63.1
## [31] splines_4.4.0 cachem_1.0.8
## [33] yaml_2.3.8 tools_4.4.0
## [35] parallel_4.4.0 colorspace_2.1-0
## [37] BiocGenerics_0.49.1 GetoptLong_1.0.5
## [39] R6_2.5.1 png_0.1-8
## [41] magick_2.8.3 stats4_4.4.0
## [43] matrixStats_1.3.0 lifecycle_1.0.4
## [45] S4Vectors_0.41.7 IRanges_2.37.1
## [47] clue_0.3-65 cluster_2.1.6
## [49] bslib_0.7.0 Rcpp_1.0.12
## [51] highr_0.10 xfun_0.43
## [53] r.jive_2.4 rstudioapi_0.16.0
## [55] knitr_1.46 rjson_0.2.21
## [57] htmltools_0.5.8.1 rmarkdown_2.26
## [59] Cairo_1.6-2 compiler_4.4.0